Adapted with permission from STAT 301 Project by: Justin Wong, Kevin Yu, Zhuoran (Serena) Feng, Fiona Chang
In this project, we perform a data analysis to determine the factors that impact the predict the productivity of a garment factory. Using forward selection and LASSO, we compare different models and determine which factors are the best in explaining relationships between the factors and the actual productivity of the garment factory. Furthermore, we discuss the implications of our results, the limitations of the project, and propose future questions that can be asked based on our project.
The trillion-dollar garment industry is largely fueled by the production and performance of employees that work in manufacturing companies as a labor-intensive, low-skilled industry (Hamja, Maalouf, and Hasle 2019). As the industry is driven by ever-changing consumer demands and fashion trends, the need for manual processes is inevitable. Through statistical inference, we seek to dig deeper into the relationship between important attributes of the garment manufacturing process and its employees’ productivity in the following question: What factors affect the productivity of a garment factory?
The studies: “Enhancing Efficiency and Productivity of Garment Industry by Using Different Techniques” (Rajput et al. 2018) and “The Effect of Lean on Occupational Health and Safety and Productivity in the Garment Industry” (Hamja, Maalouf, and Hasle 2019) will be utilized to help frame our exploration into this data set and provide useful context of the garment industry.
The data set we will use, called Productivity Prediction of Garment Employees, is sourced from Kaggle.com (Siri 2021) and outlines following variables that will guide us in answering our question:
- date: Date in MM-DD-YYYY
- quarter: A portion of a month, where each month was divided into 4 quarters
- department: Associated department
- day: Day of the week
- team: Associated team number
- targeted_productivity_set: Daily target productivity set by authority
- smv: Standard Minute Value; allocated time for a task
- wip: Work in progress; includes number of unfinished items for products
- over_time: amount of overtime by each team (minutes)
- Incentive: amount of financial incentive in BDT (Bangladeshi currency)
- idle_time: amount of time where production was interrupted
- Idle_men: number of workers idle due to interrupted production
- no_of_style_changes: number of style changes
- no_of_workers: number of workers in given team
- actual_productivity: actual % of productivity delivered
From this list, date and team were excluded in our analysis as they are identifiers for the observation, hence not the interest for our research question.
| date | quarter | department | day | team | targeted_productivity | smv | wip | over_time | incentive | idle_time | idle_men | no_of_style_change | no_of_workers | actual_productivity |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1/1/2015 | Quarter1 | sweing | Thursday | 8 | 0.80 | 26.16 | 1108 | 7080 | 98 | 0 | 0 | 0 | 59.0 | 0.9407254 |
| 1/1/2015 | Quarter1 | finishing | Thursday | 1 | 0.75 | 3.94 | NA | 960 | 0 | 0 | 0 | 0 | 8.0 | 0.8865000 |
| 1/1/2015 | Quarter1 | sweing | Thursday | 11 | 0.80 | 11.41 | 968 | 3660 | 50 | 0 | 0 | 0 | 30.5 | 0.8005705 |
| department | day | targeted_productivity | smv | wip | over_time | incentive | idle_time | idle_men | no_of_style_change | actual_productivity | half |
|---|---|---|---|---|---|---|---|---|---|---|---|
| sewing | Weekday | 0.80 | 26.16 | 1108 | 7080 | 98 | 0 | 0 | 0 | 0.9407254 | Half1 |
| finishing | Weekday | 0.75 | 3.94 | 0 | 960 | 0 | 0 | 0 | 0 | 0.8865000 | Half1 |
| sewing | Weekday | 0.80 | 11.41 | 968 | 3660 | 50 | 0 | 0 | 0 | 0.8005705 | Half1 |
Both the categorical variables day and quarter were edited becasue they have more than two levels, which leads to difficulties in conducting forward selection. Based on this, day was changed into two levels: Weekday and Weekend. Similarilily, quarter was used to create the variable half with two levels.
Figure 3.1: ggpairs Plot
From this plot, we can analyze the correlation values between the variables that we are using in our analysis. Based on the correlation values, there appears to be correlation between input variables, which will be addressed later in the analysis.
Variables with relatively high correlations (over 0.65) include no_of_workers and smv, no_of_workers and over_time, and over_time and smv.
The high correlation indicates that the dataset has an issue of multicollinearity. To address this, the variable with the (no_of_workers) will be removed from our analysis.
Figure 3.2: Actual Productivity by Day of the Week Boxplot
Figure 3.3: Actual Productivity by Half Boxplot
Figure 3.4: Actual Productivity of Departments Boxplot
The above boxplots show the medians and variances of the discrete factors of interest. Since the above plots show that there are differences in the medians and variances of the factors of interest, it is justified to keep these factors in our analysis for future investigation.
Figure 3.5: Distribution of Actual Productivity
Since, the distribution of the actual productivity doesn’t seem to have a normal distribution and is slightly left-skewed, an assumption of normality is likely needed in our analysis.
Figure 3.6: Q-Q Plot of Actual Productivity
The above Q-Q plot was used to determine whether a normality assumption on our response variable is valid, since it is an assumption required for tests used later on. There appears to be tails on both ends, which suggests a left-skewness of the data. Unfortunately, while we have tried different transformations of the data, it did not improve the skewness of this Q-Q plot. So based on what we have learned in this class, we will have to assume normality of the data even though it is a major stretch.
| half | department | count | mean | median | min | max | sd |
|---|---|---|---|---|---|---|---|
| Half1 | finishing | 296 | 0.7616317 | 0.8096023 | 0.2380417 | 1.096633 | 0.1823244 |
| Half1 | sewing | 399 | 0.7374970 | 0.7999632 | 0.2337055 | 1.100484 | 0.1522631 |
| Half2 | finishing | 210 | 0.7407145 | 0.7862321 | 0.2357955 | 1.120437 | 0.2159050 |
| Half2 | sewing | 292 | 0.7008552 | 0.7500680 | 0.2494167 | 1.000457 | 0.1559532 |
This table provides relevant summaries of our data, split into different quarters and departments. Overall, the data seems to be relatively consistent, with a few things to note:
The data set used is trustworthy and reliable since multiple published academic papers used this data set (Al Imran et al. 2019), (Imran, Rahim, and Ahmed 2021).
Using the data set, we plan to analyze what factors are the most important in productivity. Linear regression will be used to determine the best inference model for the actual productivity of the factory. Using forward selection and LASSO, we plan to compare different models and determine which factors are the best in explaining relationships between the factors and the actual productivity of the garment factory. Additionally, we plan to test our optimal inference model’s performance by splitting the data into training and testing and comparing the corresponding adjusted \(R^2\) values with the full model.
We expect that factors such as number of members on the team and targeted productivity may have a higher association with actual productivity. Thus, we expect these factors to be present in the best model for explaining the relationship with the actual productivity.
The results from this report could provide insights to companies in the garment manufacturing sector. Having knowledge on what factors may increase productivity is crucial for any successful business.
| department | day | targeted_productivity | smv | wip | over_time | incentive | idle_time | idle_men | no_of_style_change | actual_productivity | half |
|---|---|---|---|---|---|---|---|---|---|---|---|
| sewing | Weekend | 0.75 | 18.79 | 1193 | 3960 | 45 | 0 | 0 | 0 | 0.7506510 | Half1 |
| sewing | Weekend | 0.80 | 26.16 | 1128 | 10620 | 63 | 0 | 0 | 0 | 0.8001171 | Half2 |
| finishing | Weekend | 0.75 | 3.94 | 0 | 1620 | 0 | 0 | 0 | 0 | 0.9617845 | Half2 |
| department | day | targeted_productivity | smv | wip | over_time | incentive | idle_time | idle_men | no_of_style_change | actual_productivity | half |
|---|---|---|---|---|---|---|---|---|---|---|---|
| finishing | Weekday | 0.75 | 3.94 | 0 | 960 | 0 | 0 | 0 | 0 | 0.7551667 | Half1 |
| sewing | Weekday | 0.75 | 19.87 | 733 | 6000 | 34 | 0 | 0 | 0 | 0.7530975 | Half1 |
| finishing | Weekday | 0.65 | 3.94 | 0 | 960 | 0 | 0 | 0 | 0 | 0.7059167 | Half1 |
| n_input_variables | RSQ | RSS | ADJ.R2 |
|---|---|---|---|
| 1 | 0.1945884 | 21.49255 | 0.1936885 |
| 2 | 0.2164887 | 20.90814 | 0.2147359 |
| 3 | 0.2281308 | 20.59747 | 0.2255377 |
| 4 | 0.2310129 | 20.52056 | 0.2275645 |
| 5 | 0.2337486 | 20.44755 | 0.2294487 |
| 6 | 0.2377994 | 20.33946 | 0.2326610 |
| 7 | 0.2384121 | 20.32311 | 0.2324154 |
| 8 | 0.2389621 | 20.30843 | 0.2321059 |
| 9 | 0.2392325 | 20.30121 | 0.2315134 |
| 10 | 0.2394983 | 20.29412 | 0.2309148 |
| 11 | 0.2394995 | 20.29409 | 0.2300469 |
Adjusted \(R^2\) was chosen as the metric used for model selection because it is best suited for our inference model. It compensates for the reduction of the RSS of a larger model making it a more suitable metric than \(R^2\).
As shown by the table above, the model with 6 input variables has the highest adj \(R^2\), thus this model is chosen as the optimal model. However, its adjusted \(R^2\) value is not relatively higher than many of the other models, suggesting that the model may be only slightly better than the others. The selected model will be compared with the full model to test this observation.
The variables selected in the model with 6 input variables were: targeted_productivity, smv, wip, incentive, idle_men, and no_of_style_change.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.2425582 | 0.0393780 | 6.159742 | 0.0000000 |
| targeted_productivity | 0.6992923 | 0.0520280 | 13.440690 | 0.0000000 |
| smv | -0.0012414 | 0.0005188 | -2.392951 | 0.0169199 |
| wip | 0.0000076 | 0.0000035 | 2.174850 | 0.0299041 |
| incentive | 0.0000677 | 0.0000357 | 1.896537 | 0.0582127 |
| idle_men | -0.0066583 | 0.0015499 | -4.296067 | 0.0000193 |
| no_of_style_change | -0.0369644 | 0.0127146 | -2.907235 | 0.0037369 |
Adjusted \(R^2\) for Selected Model: 0.2327
Figure 3.7: Residuals of Selected Model
Figure 3.8: QQ-plot of Selected Model
The residual plot and the Q-Q plot suggests some violations of assumptions needed for our analysis. The residual plot shows slight heteroskedasticity within our model, which violates our equal variance assumption, and the q-q plot suggests a violation in our normality assumption.
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.2436616 | 0.0407603 | 5.9779105 | 0.0000000 | 0.1636634 | 0.3236597 |
| departmentsewing | 0.0169580 | 0.0219195 | 0.7736501 | 0.4393443 | -0.0260623 | 0.0599783 |
| dayWeekend | 0.0061356 | 0.0108010 | 0.5680563 | 0.5701408 | -0.0150630 | 0.0273341 |
| targeted_productivity | 0.6988608 | 0.0526825 | 13.2655250 | 0.0000000 | 0.5954636 | 0.8022580 |
| smv | -0.0018777 | 0.0009822 | -1.9117663 | 0.0562289 | -0.0038055 | 0.0000500 |
| wip | 0.0000071 | 0.0000036 | 1.9496519 | 0.0515331 | 0.0000000 | 0.0000142 |
| over_time | -0.0000001 | 0.0000021 | -0.0364780 | 0.9709094 | -0.0000043 | 0.0000041 |
| incentive | 0.0000662 | 0.0000359 | 1.8430641 | 0.0656539 | -0.0000043 | 0.0001367 |
| idle_time | 0.0003560 | 0.0004497 | 0.7916867 | 0.4287555 | -0.0005266 | 0.0012387 |
| idle_men | -0.0077306 | 0.0020215 | -3.8242258 | 0.0001404 | -0.0116981 | -0.0037632 |
| no_of_style_change | -0.0351849 | 0.0135110 | -2.6041730 | 0.0093640 | -0.0617022 | -0.0086676 |
| halfHalf2 | -0.0057576 | 0.0104672 | -0.5500580 | 0.5824184 | -0.0263009 | 0.0147858 |
Adjusted \(R^2\) for Full Model: 0.23
The adjusted \(R^2\) of the selected model (0.2327) is slightly larger than (0.23) the full model. This further suggests the selected model is not significantly better than the full model. An F-test will be conducted to test this observation.
| Res.Df | RSS | Df | Sum.of.Sq | F | Pr..F. |
|---|---|---|---|---|---|
| 890 | 20.33946 | NA | NA | NA | NA |
| 885 | 20.29409 | 5 | 0.0453663 | 0.3956736 | 0.8519733 |
Since the p-value is 0.852, at a 5% significance level, there is not enough evidence to reject the null hypothesis that the selected model performs better than the full model.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.2689469 | 0.0788155 | 3.412359 | 0.0007347 |
| targeted_productivity | 0.6689556 | 0.1052842 | 6.353807 | 0.0000000 |
| smv | -0.0011865 | 0.0009566 | -1.240232 | 0.2158819 |
| wip | 0.0000077 | 0.0000067 | 1.138709 | 0.2557548 |
| incentive | 0.0000563 | 0.0000465 | 1.210549 | 0.2270439 |
| idle_men | -0.0099292 | 0.0030339 | -3.272772 | 0.0011924 |
| no_of_style_change | -0.0285627 | 0.0243788 | -1.171623 | 0.2423002 |
Adjusted \(R^2\) for Selected Model Using Testing Data: 0.1702
The adjusted \(R^2\) suggests that about 17% of the adjusted variation in the response is explained by model. This indicates that the model performs fairly poorly. However, since it was the best model according to our analysis it may suggest that further exploration of this topic is needed. Thus, model selection using LASSO is conducted next.
Figure 3.9: Lambda Selection by CV with LASSO
The variables selected in the model by LASSO were: targeted_productivity, smv, incentive, idle_men, and no_of_style_change.
Figure 3.10: Residuals of Selected Model with LASSO
Figure 3.11: QQ-plot of Selected Model with LASSO
Much like the model chosen by forward selection, the residual plot for the model selected by LASSO suggests unequal variance, and the Q-Q plot above suggests a normality violation of the variables. Attempts to mitigate this issue have failed, meaning we will have to continue with our analysis with very strong assumptions of our data.
Adjusted \(R^2\) for Selected Model 0.2294
| Res.Df | RSS | Df | Sum.of.Sq | F | Pr..F. |
|---|---|---|---|---|---|
| 891 | 20.44755 | NA | NA | NA | NA |
| 885 | 20.29409 | 6 | 0.1534619 | 1.11538 | 0.3512442 |
Since the p-value is 0.351, at a 5% significance level, there is not sufficient evidence to reject the null hypothesis that the selected model by LASSO performs better than the full model.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.2655908 | 0.0788001 | 3.3704351 | 0.0008508 |
| targeted_productivity | 0.6735619 | 0.1052595 | 6.3990577 | 0.0000000 |
| smv | -0.0008335 | 0.0009055 | -0.9204962 | 0.3580682 |
| incentive | 0.0000573 | 0.0000465 | 1.2312795 | 0.2192024 |
| idle_men | -0.0099995 | 0.0030348 | -3.2949306 | 0.0011050 |
| no_of_style_change | -0.0297716 | 0.0243679 | -1.2217536 | 0.2227795 |
Adjusted \(R^2\) for Selected Model Using Testing Data: 0.1694
The adjusted \(R^2\) suggests that about 16.9% of the adjusted variation in the response is explained by model. This reveals that forward selection performed slightly better in producing an inference model for the actual productivity of the factory.
The variables selected from forward selection in our model were: targeted_productivity, smv, wip, incentive, idle_men, and no_of_style_change. The variables selected from LASSO in our model were: targeted_productivity, smv, incentive, idle_men, and no_of_style_change.
Both of the models produced a fairly poor adjusted \(R^2\) values of 0.1702 and 0.1694 when testing the model with the testing data. Additionally, neither of the selected models were significantly better than the full model according to the corresponding F-tests.
The relatively poor performance of both selected models and the non-significant results from the corresponding F-tests may be due to the assumptions made throughout our analysis. The techniques learned in this class were likely not able to overcome the limitations and assumptions made throughout our analysis. As shown by the various model assumption plots throughout the analysis, the assumptions of equal variance and normality were used when analyzing the response variable, as well as both of the models selected via forward selection and LASSO.
The assumption of normality was required for the F-tests used near the end of our analysis. With our diagnostic plots shown earlier showing some violation of the assumption, it may impact our results when testing for whether the models we selected were statistically different than the full model.
Another factor that may have led to our results being non-significant relative to the full model is the response variable (actual productivity) ranges from 0 to 1 while our explanatory variables have much broader ranges. In addition, the variables wip, incentive, idle_time, and idle_men, used in our analysis contained a large amount of 0s with a few observed large values leading to abnormal distributions of values.
These issues regarding normality and heteroskedasticity in our data and models may have resulted in inaccurate standard errors and thus induced lower precision from coefficient estimates, as well as inaccurate p-values and F-statistics. These limitations may have affected the statistical significance of our results.
The common variables included from both forward selection and LASSO were targeted_productivity, smv, incentive, idle_men, and no_of_style_change. The models suggest that the daily set productivity, allocated time for a task, financial incentive, number of idle workers, and number of style changes have the strongest correlation with actual productivity. These findings can drive business decisions of those in management and leadership positions as they could potentially manipulate each variable to drive the highest amount of productivity, and thus, profit. For example, a higher monetary incentive per item will motivate workers to be more efficient, and can have higher payoffs overall, though one should be cautious of unsatisfactory work.
The study “Enhancing Efficiency and Productivity of Garment Industry by Using Different Techniques” relays methods of increasing productivity by eliminating factors such as idle time, related to the idle_men variable in our model. These methods include time study, implementing a visual management system, and standardized work procedures which increased efficiency by 8.07% (Rajput et al. 2018). Focussing on decreasing the number of style changes through process management can also increase productivity where less set-up and transition times between patterns reduce time wasted. “The effect of lean on occupational health and safety and productivity in the garment industry” outlines lean methodology where waste is minimized by reducing variability on all fronts of production (Hamja, Maalouf, and Hasle 2019). While there is evidence of positive effects on productivity by using lean, the literature also points to a potential negative impact on workers’ health– which morally outweighs monetary profit (Hamja, Maalouf, and Hasle 2019).
This study prompts further questions about the garment industry. - How can we improve these variables to have a more efficient productivity? - What is the threshold of which productivity is maximized? - What other variables outside of this dataset, particularly involving technological innovation, affect productivity? - What is the environmental impact of increasing productivity?